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f~^ ' Abstract 

vN ■ We study the problem of approximating and learning coverage functions. A function c : 2'"' -^ R^ 

J-H ' is a coverage function, if there exists a universe U with non-negative weights ■w{u) for each u G U and 

^^, subsets Ai, A2, . . . , An of U such that c{S) = X^ugu a- ™(w)- Ahernatively, coverage functions can be 

■^^ ' described as non-negative hnear combinations of monotone disjunctions. They are a natural subclass of 

submodular functions and arise in a number of applications. 

We show that over the uniform distribution coverage functions with range [0, 1] are PAC learnable 
to ii error of e in poly(n, 1/e) time and using poly(logn, 1/e) random examples. We also show a proper 
learning algorithm for coverage functions whose running time is polynomial in the size of the universe 
J ■ over which the coverage function is defined. Our algorithm is based on several new structural properties 

• ' of the Fourier spectrum of coverage functions and, in particular, we prove that any coverage function 

Yi , can be e-approximated in £1 by a coverage function that depends only on 1/e^ variables. In contrast, we 

show that, without assumptions on the distribution, learning coverage functions is at least as hard as 
learning polynomial-size disjoint DNF formulas, a class of function for which the best known algorithm 
runs in time n°^"'^''^ [KS04]. 

Our PAC learning algorithm on the uniform distribution implies the first polynomial time differentially 
private algorithm for releasing monotone disjunction queries with low average error over the uniform 
("■^ ' distribution on disjunctions. This problem was first considered by Gupta et al. IGHRUll' and the best 

r\l , previous algorithm runs in time n ''"^'^'"^^ where a is the accuracy of release !CKKL12 . Further, our 

proper learning algorithm implies that the queries can be released using a synthetic database in time 
poly(n)-log(l/a)°('°s(i/'^». 
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1 Introduction 

We consider learning and approximation of the class of coverage functions over the Boolean hypercube 
{—1, 1}". A function c : 2["1 — > R+ is a coverage function if there exists a family of sets Ai,A2, . . . , A„ on a 
universe U equipped with a weight function w : U ^ R^ such that for any S C [n], 

ciS) = w(U,esA,), 

where w is extended additively in the natural way such that w{T) = X]ueT^(") f*-"^ ^^y T C U. We 
view these functions over {—1, 1}" by associating each subset S C [n] with vector x^ € {~1: 1}" such that 
xf = — 1 iff i G S*. We define the size (denoted by size(c)) of a coverage function c as the size of a smallest-size 
universe U that can be used to define c. As is well-known, coverage functions also have an equivalent and 
natural representation as non-negative linear combinations of monotone disjunctions with the size being the 
number of disjunctions in the combination. 

Coverage functions form a relatively simple but important subclass of the broad class of submodular 
functions. Submodular functions have been studied in a number of contexts and play an important role in 
combinatorial optimization |Lov83[ IGW95[ IFFIOll lEdmTOl IFra97[ IFei98| and a number of its applications to 
machine learning [GKS05, KGGK06> KGllJ and in algorithmic game theory where they are used to model 
valuation functions ^BLN06. .DS06. .VonOSj . Coverage functions themselves figure in several applications such 
as facility location |CFN77) . private data release of conjunctions jGHRUlT] and algorithmic game theory 
where they are used to model the utilities of agents in welfare maximization and design of combinatorial 
auctions [DVllj . 

In this paper, we investigate the learnability of coverage functions from random examples. The study of 
learnability from random examples of the larger classes of functions such as submodular and fractionally- 
subadditive functions has been initiated by Balcan and Harvey JBH12] who were motivated by applications 
in algorithmic game theory. They introduced the PMAC model of learning in which, given random and 
independent examples of an unknown function, the learner is required to output a hypothesis that is multi- 
plicatively close (which is the standard notion of approximation in the optimization setting) to the unknown 
target on at least 1 — e fraction of the points. This setting is also considered in IBCIW121IBDF+12I . Learning 
of submodular functions with less demanding (and more common in machine learning) additive guarantees 
was first considered by Gupta et al. ^GHRUll; . who were motivated by problems in private data release. 
Their setting allows a form of value querieqj and the distribution is restricted to be uniform. In this setting 
the goal of the learner is equivalent to producing a hypothesis that e-approximates the target function in 
ii or £2 norm (that is IEx~X'[|/(2;) — ff(2;)|] where V is the underlying distribution on the domain). The 
same notion of error and restriction to the uniform distribution are also used in several subsequent works 
on learning of submodular functions |CKKL12l IRY121 IFKV13] and are the focus of our work. For a more 
detailed survey of submodular function learning the reader is referred to |BH12| . 

Applications of coverage functions also have led to an increased interest in understanding of their struc- 
tural properties. Badanidyuru lBDF"'"12] et al. study sketching of coverage functions and prove that there are 
small (polynomial in the dimension and the error parameter) approximate representations that multiplica- 
tively approximate any given coverage function at all points within a given relative error. Their result implies 
an algorithm for learning coverage function in the PMAC model JBH12] . However, their algorithm is not 
efficient and takes time exponentially large in the underlying dimension and error parameter. Chakrabarti 
and Huang jCH12| study the problem of testing coverage functions (under what they call the W-distance) 
and show that the class of coverage functions that have small representations can be reconstructed, that is, 
one can obtain in polynomial time, a representation of an unknown coverage function c of small size, that 
computes c correctly at all points, using polynomially many value queries. Their reconstruction algorithm 
can be seen as an exact learning algorithm with value queries. 

1.1 Our Results 

Distribution-independent learning: Our main results are for the uniform distribution learning of cover- 
age functions. However it is useful to first understand the complexity of learning these functions without any 



'^A value query on a point in a domain returns the value of the target function at the point. For Boolean functions it is 
usually referred to as membership query. 



distributional assumptions (for a formal definition and details of the models of learning see Section [2). We 
prove that distribution-independent learning of coverage functions is at least as hard as PAC learning the 
class of polynomial-size disjoint DNF formulas over arbitrary distributions. Polynomial-size disjoint DNF 
formulas is an expressive class of Boolean functions that includes the class of polynomial-size decision trees, 
for example. Moreover, there is no known algorithm for learning polynomial-size disjoint DNFs that runs 
faster than the algorithm for learning general DNF formulas, the best known algorithm for which runs in 
time 2'^'^" ^ |KS04) . Let CV denote the class of coverage functions over {—1, 1}" with range in [0, 1]. 

Theorem 1.1. Let A be an algorithm that learns all coverage functions in CV of size at most s with £i-error 
e in time T{n, s, -). There exists an algorithm A' that PAC learns the class of s-term disjoint-DNF in time 
r(2n,s,^). 

In Section 16.21 we also show that learning (both distribution specific and distribution independent and 
in both PAC and agnostic models) of coverage functions is at most as hard as learning the class of linear 
thresholds of monotone Boolean disjunctions. A special case of this simple reduction was also employed by 
Hardt et al. f HRS12j in the context of privately releasing conjunction queries that we discuss later. 

Structural Results and Learning on the Uniform Distribution: Our learning algorithms on the 
uniform distribution (denoted by U) are based on several structural results about coverage functions. The 
key property that we exploit for PAC learning is that the Fourier coefficients of coverage functions have a 
form of (anti-)monotonicity property. 

Lemma 1.2. Let c : {—1, 1}" — > [0, 1] be a coverage function. For any non empty T C T/ C [n], \c{T)\ > 
\£iV)\. 

This lemma allows us to find all significant Fourier coefficients of a coverage function efficiently using a 
search procedure analogous to that in the Kushilevitz-Mansour algorithm [ KM91) (but without the need for 
value queries). This contrasts strongly with Theorem 11.11 since disjoint DNF are not known to be learnable 
efficiently even under the uniform distribution (with the best known algorithm requiring n'^(^°s^) time). 

We remark that in a very recent and independent work, Yang et al. [YBC13J develop a subroutine for 
learning sums of monotone conjunctions using a variant of the same idea. Their application is in a very 
different context of learning DNF expressions from numerical pairwise queries, which given two assignments 
from {—1, 1}" to the variables, expects in reply, the number of terms of the target DNF satisfied by both 
assignments. 

To make our algorithms more efficient and make the dependence of the number of random examples 
on n logarithmic (which implies attribute efficiency JBL97J ) we prove that any coverage function can be 
approximated by a function of few variables (often referred to as junta). 

Theorem 1.3. Let c : {—1, 1}" — > [0, 1] be a coverage function and e > 0. There exists a coverage function 
d : { — 1, 1}" — > [0, 1], that depends only on 0{l/e^) variables and satisfies W.,u[\c{x) — c'(x)|] < e. 

We note that approximation by an exponentially larger 2^^^''^ > junta was recently proved by Feldman et 
al. [FKV13| for all submodular functions. It is not hard to see that i7(l/e^) variables are also necessary for e- 
approximating a coverage function (since they include monotone linear functions) . This clearly distinguishes 
coverage functions from disjunctions themselves which can always be approximated using a function of just 
0(log(l/e)) variables. 

Finally, our proper learning algorithms (both agnostic and PAC) rely on a simple approximation of any 
coverage function by a coverage function that includes only the disjunctions of length at most log(l/e) and 
the full disjunction. For S* C [n], we denote a monotone disjunction of variables with indices in S by ORg). 

Lemma 1.4. Let c : {—1,1}" -^ [0,1] be a coverage function and e > 0. Then, for k = [log (-))], there 
exists a coverage function c! : {—1, 1}" — )• [0, 1], c' = X^isKfeO^S ' OR5, such that 'E,u\\c{x) — c!(x)\\ < e. 

We note that a similar (although not proper) form of approximation: by a linear combination of parities 
of degree at most log(l/e) is also used in the private data release algorithm in [CKKL12, . 



As a point of comparison, the sketching result of Badanidiyuru et al. (BDF+12 shows that for every 



coverage function, there is a coverage function with small size that approximates the function within 1 ± e 
relative error at all points. The size of the approximating function is guaranteed to be polynomial in n and 



1/e. However, there is no known algorithm to compute this strong approximation in subexponential time. 
In contrast, for our result, the succinctly represented approximating coverage function (with size that does 
not depend of n) can actually be computed efficiently using random examples alone. 

Using the structural results above we obtain a polynomial-time and attribute-efficient PAC learning 
algorithm for coverage functions. 

Theorem 1.5. There exists an algorithm, which given e > and access to random uniform examples of any 
c e CV , with probability at least 2/3, outputs a hypothesis h such that Ki([\h{x) — c{x)\] < e. The algorithm 
runs in 0{n/e'^ + 1/e*) time and uses \ogn ■ 0(l/e^) examples. 

Exploiting the representation of coverage functions as non-negative linear combinations of monotone 
disjunctions, we show that we can actually get an algorithm that outputs a hypothesis that is guaranteed to 
be a coverage function. That is, the algorithm is proper. The running time of this algorithm is polynomial 
in n and, in addition, depends polynomially on the size of the target coverage function. 

Theorem 1.6. There exists an algorithm, that for any e > 0, given random and uniform examples of 
any c G CV, with probability at least 2/3, outputs a coverage function h such that Ki([\h{x) — c{x)\] < 
e. The algorithm runs in time 0{n) ■ poly(s/e) and uses log (n) • poly(s/e) random examples, where s = 
min{size(c),(l/e)'°s(i/'^)}. 

We note that for general submodular functions the best known algorithm has exponential dependence on 
1/e and this is known to be necessary information-thcorctically [FKVIS] . 

We also briefly consider learning of coverage functions in the agnostic learning model |Hau92[ IKSS94J 
where coverage functions behave essentially like monotone disjunctions (see Section U for details). In par- 
ticular, using Lemma 11.41 together with a simple algorithm based on linear regression with (.i error we get 
a proper agnostic learning algorithm for coverage functions over the uniform distribution running in time 
^o(iog(i/e))_ Yi is not hard to show that this algorithm is essentially the best possible assuming hardness of 
learning sparse parities with noise. 

1.2 Applications to Differentially Private Data Release 

As an application of our PAC learning algorithm, we obtain a polynomial time algorithm for the problem of 
privately releasing disjunction (or, equivalently, conjunction) queries on a database. We now briefly overview 
the problem and state our results. Formal definitions appear in Section [S] and a more detailed background 
discussion can for example be found in JTUV12] . The objective of a private data release algorithm is to 
release answers to all queries from a given class C with low error while protecting the privacy of participants 
in the database. Specifically, we are given a database D which is a subset of a fixed domain X (in our case 
X — { — 1, 1}"). Given a query class C of Boolean functions on {—1, 1}", the objective is to output a data 
structure H that that allows answering counting queries from C on _D with low error. A counting query for 
c € C gives the fraction of elements in D on which c equals to 1. The algorithm producing H should be 
differentially private |DMNS0"6| . Informally speaking, the release algorithm is differentially private if adding 
an element of X to (or removing an element of X from) D does not affect the probability that any specific H 
will be output by the algorithm significantly. A natural and useful way of private data release for a database 
D is to output another database D (- X (in a differentially private way) such that answers to counting 
queries based on D approximate answers based on D. Such database is referred to as synthetic database. 
One of the advantages of release using a synthetic database is that a synthetic database can be used in the 
same software pipeline as the original database. 

Boolean conjunction(equivalently disjunction) queries form one of the simplest and most well-studied 
classes for the problem of private data release |BCD+07[ lUVlli ICKKL12[ IHRS121 ITUV12J . They are a part 
of the official statistics in the form of reported data in the US Census, Bureau of Labor statistics and the 
Internal Revenue Service. Despite the relative simplicity of this class of functions, the best known algorithm 
for releasing all counting queries on conjunctions of size k with a constant error takes time n^^^'^' (and 
requires a database of size at least n^^^') |TUV12) . 

Gupta et al. [GHRUlTj define a relaxed version of the query release problem in which the error needs to 
be small on 1 — /3 fraction of queries from C for a given parameter /3 (or more generally on 1 — /3 fraction 
relative to some distribution on concepts in C). They relate this relaxed notion of private query release to 



PAC learnability of a counting functions (that map a concept in C to the value of counting query for C on 
D) using tolerant value query access to the function. They use this reduction together with an algorithm 
for learning submodular functions to give an algorithm for privately releasing monotone disjunction queries 
with error a that takes time ■nP'^^°S'0-/P)/°' ). Cheraghchi et al. give a simple algorithm that improves this 
dependence to nO(i°s(i/M))) [CKKL12]. 

Several lower bounds are known for the problem. Building on the work of Dwork et al. [DNR+OO] UUman 
and Vadhan [UVll] showed that there exists a constant 7 such that there is no polynomial time algorithm 
for releasing a synthetic database that answers all conjunction queries on sets of size 2 within an error of at 
most 7 under some mild cryptographic assumptions. Note that although the lower bound in [UV11| rules 
out a polynomial time algorithm for releasing private synthetic databases that answer all conjunction queries 
with a constant worst-case error, it does not rule out the release with low error relative on 1 — /3 fraction 
of all queries. Gupta et al. [GHRUlf] show that releasing all queries from a concept class C using counting 
queries (when accessing the database) requires as many counting queries as agnostically learning C using 
statistical queries. Using lower bounds on statistical query complexity of agnostic learning of conjunctions 
by Feldman |Fell2] they derived a lower bound on counting query complexity for releasing all conjunctions 
of certain length. 

As it can be easily seen (e.g. [GHRUlTl ICKKL12J ). the function mapping a monotone conjunction c to 
a counting query for c is a convex combination of monotone disjunctions, that is a coverage function. Using 
standard techniques we adapt our PAC learning algorithms for coverage functions to give two algorithms 
for privately releasing monotone disjunction queries. Our first algorithm uses Theorem 11.51 to obtain a 
differentially private algorithm for releasing monotone disjunction queries in time polynomial in n (the 
database dimension), l/a and 1//?. This is the first fully-polynomial differentially private algorithm that 
answers monotone disjunction queries with respect to the uniform distribution on the queries. For simplicity, 
instead of a and /3 we bound the average error a that directly corresponds to (,1 error of our learning 
algorithms. Note that setting a — a ■ P implies that with probability at least 1 — /3 the error is at most a. 

Theorem 1.7. Let C be the class of all monotone disjunctions. For every e,6 > 0, there exists an e- 
dijferentially private algorithm which for any database D C {—1, 1}" of size ^{n\og{\ / 5) / {ea^)) , with prob- 
ability at least \ — 5, publishes a data structure H that answers queries from C with respect to the uniform 
distribution with an average error of at most a. The algorithm runs in time 0{n'^log{l/S)/{ea^'^)) and the 
size of H . 



Our second algorithm uses Theorem 11.61 to obtain a differentially private algorithm that releases a syn- 
thetic database to answer monotone disjunction queries in time polynomial in n and quasi-polynomial in 
1/a. Previous approaches to this problem JGHRUIH ICKKL12| IHRS12| ITUV12| do not release a synthetic 
database (although some general exponential time algorithms such as the Multiplicative Weights in |HR10] 
do produce a synthetic database). 

Theorem 1.8. Let C be the class of all monotone disjunctions. For every e,6 > 0, there exists an e- 
differentially private algorithm which for any database D C {—1, 1}" of size n ■ Q;~^('°s(i/a)) . log(l/i5)/e, 
with probability at least 1—6, releases a synthetic database D that can answer queries from C with respect to the 
uniform distribution with average error of at most a. The algorithm runs in time n^-a~^^^°^ '^'"•'•'•log(l/(5)/e. 



Note that our algorithm is polynomial time for any error a that is 2~ '•v'°s(")). Finally we remark that 
our algorithms access the database using counting queries and therefore lower bounds from [GHRUllj apply 
to our setting. However the lower bound in |GHRUlT] is only significant when the length of conjunctions is 
at most logarithmic in n. Such conjunctions form only an exponentially small fraction of all the conjunctions 
and therefore our algorithms do not violate this lower bound. 

2 Preliminaries 

We use {—1, 1}" to denote the n dimensional hypercube with "false" mapped to 1 and "true" mapped to —1. 
Let [n] denote the set {1,2..., n}. For S C [n], we denote by OR5 : { — 1, 1}" -^ {0, 1}, the monotone Boolean 
disjunction on variables with indices in S, that is, for any x £ { — 1, 1}", ORs(a;) = <^ Vi e S* xi = 1. 



A monotone Boolean disjunction is a simple example of a coverage function. To see this, consider 
a universe of size 1, containing a single element say u, the associated weight, w(u) = 1, and the sets 
Ai, A2, ■ . ■ 1 An such that Ai contains u if and only if i e S*. In the following lemma we describe a natural and 
folklore characterization of coverage functions as non-negative linear combination of non-empty monotone 
disjunctions (e.g. jGHRUlTj ). For completeness we include the proof in Appendix 1X1 

Lemma 2.1. A function c : {—1, 1}" — > K''' is a coverage function on some universe U , if and only if there 
exist non-negative coefficients as for every S C [n], 5^0 such that c{x) — J^seM s^$ '^s ' ^^s{x), and at 
most \U\ of the coefficients as are non-zero. 

For simplicity and without loss of generality we scale coverage functions to the range [0, 1]. Note that 

in this case, for c = J^sgM s^(ti '^^ ' ^^S we have X^scfnl s^^ '^S ~ '^((~-'-' • • • ' ~^)) — 1- I^ ^^^ discussion 
below we always represent coverage functions as linear combination of monotone disjunctions with the sum 
of coefficients upper bounded by 1. For convenience, we also allow the empty disjunction (or constant 1) in 
the combination. Note that 0R[„] differs from the constant 1 only on one point (1,1,...,!) and therefore 
this more general definition is essentially equivalent for the purposes of our discussion. Finally, note that for 
every S, as coefficients is determined uniquely by the function since OR5 is a monomial when viewed over 
{0, 1}" with corresponding to "true". 

2.1 Learning Models 

Our learning algorithms are in two standard models of learning: PAC learning [Val84| and agnostic learning 
[Hau92[ IKSS94| . For both models we use £1 error which generalizes the disagreement error used when 
learning Boolean functions in these models. The PAC learning model assumes that the learner has access to 
random examples of an unknown function from a known class of functions. 

Definition 2.2 (PAC learning with ^i-error). Let T be a class of real-valued functions on {—1, 1}" and let 
V be a distribution on {—1,1}". An algorithm A PAC learns T on T), if for every e > and any target 
function f ^ T , given access to random independent samples from V labeled by f , with probability at least 
2/3, A returns a hypothesis h such that Kx~v[\fix) — h(x)\] < e.A is said to be proper ii h £ J-. A is said to 
be efficient if h can be evaluated in polynomial time on any input and the running time of A is polynomial 
in n and 1/e. 

The agnostic learning model generalizes the definition of PAC learning to scenarios where one cannot 
assume that the input labels are consistent with a function from a given class |Hau921 IKSS94) (for example, 
as a result of noise in the labels) . For simplicity, we restrict the definition to learning of functions with range 
[0,1]. 

Definition 2.3 (Agnostic learning with i*! error). Let T be a class of real-valued functions on {—1,1}" with 
range in [0, 1] and let V be any fixed distribution on { — 1, 1}". For any distribution V over {—1, 1}" x [0, 1], 
let opt{V,T) be defined as: 

opt{V,T)^mi E [\y-fix)\]. 

feJ^{x,y)~V 

An algorithm A, is said to agnostically learn T onT) if for every e > and any distribution V on {—1, 1}" x 
[0,1] such that the marginal of V on {—1,1}" is T) , given access to random independent examples drawn 
from V, with probability at least |, A outputs a hypothesis h such that 

, E [\hix)~y\]<opt{V,T) + e. 

It is easy to see that given a set of t examples {{x^ , y^)}i<t and a set of m functions 0i , (/)2 , . . . , 4>m finding 
coefficients ai, . . . , am which minimize 



E 

i<t 



J2aj(j>jix') 

j<m 



can be formulated as a linear program. This LP is referred to as Least- Absolute-Error (LAE) LP (or Least- 
Absolute-Deviation LP, or ii linear regression) [Wikj . Together with standard uniform convergence bounds 
for linear functions |Vap98| , LAE LP gives a general technique for agnostic learning with ^i-error. 



Theorem 2.4 (Agnostic learning via LAE LP). Let T be a class of real-valued functions from {—1, 1}" to 
[—B,B] for some B > 0, V be distribution on {—1,1}" and 4>i,(J)2t ■ ■ ,4'm '■ {~1: 1}" ~^ [~B^B] be a set 
of functions that can be evaluated in time polynomial in n. Assume that there exists A such that for each 
f ^ J- , there exist reals ai, a2, ■ ■ ■ , otm such that 



E 

an~'D 



^ a^(t)^{x) - f{x) 



< A. 



Then there is an algorithm that for every e > 0, given access to random examples of f ^ J- , with probability 
at least 2/3, outputs a function h such that 



E [\h{x) 



y\ 



< A 



The algorithm uses 0{m ■ _B^/e^) examples, runs in time polynomial in n, m, Bje and returns a linear 
combination of (pi 's. 

Remark 2.1. Additional linear constraints on Ui 's can be added to this LP as long as these constraints are 
satisfied by linear combinations that approximate each f ^ J- within A. In particular we will use this LP 
with each ai being constrained to be non-negative and their sum being at most 1. 

We remark that this approach to agnostic learning can equivalently be seen as learning based on Empirical 
Risk Minimization with absolute loss |Vap98| . For a Boolean target function, a hypothesis with ii error of 
e also gives a hypothesis with classification error of e (e.g. |KKMS08) ). Therefore, as demonstrated by Kalai 
et al. [KKMS08] . LAE LPs are also useful for agnostic learning of Boolean functions. 

In the statistical query (SQ) learning model of Kearns jKea98) the learner has access to a statistical query 
oracle instead of random examples of some / over distribution D. Statistical query (SQ) oracle is an oracle 
that for a given tolerance r > and query function (j) : { — 1, 1}" x [0, 1] — > [—1, 1], returns some value v such 
that \v-Ed[4'{xJ{x))]\ < T. 



2.2 Fourier Analysis on the Boolean Cube 

When learning with respect to the uniform distribution we use several standard tools and ideas from Fourier 
analysis on the Boolean hypercube. For any functions f,g : {—1,1}" -> M, the inner product of / and g 
is defined as {f,g) = ^x~u[f{x) ■ g{x)]. The £i and £2 norms of / are defined by ||/||i = E2;^w[|/(a^)|] and 
11/11 2 — ^xr^uifi^)'^] respectively. Unless noted otherwise, in this context all expectations are with respect 
to X chosen from the uniform distribution. 

For S C [n], the parity function xs ■ {"~lil}" ^ {~1j1} is defined as xs{x) — Yiies^i- Parities form 
an orthonormal basis for functions on {—1, 1}" (for the inner product defined above). Thus, every function 
/ : {—1,1}" — > R can be written as a real linear combination of parities. The coefficients of the linear 
combination are referred to as the Fourier coefficients of /. For / : { — 1, 1}" -^ K and S C [n], the Fourier 
coefiicient f{S) is given by /(5) = (/, xs) - E[/(a;)xs(x)]. 

The Fourier expansion of / is given by f{x) — J2sc\n] f{^)xs{x)- For any function / on { — 1,1}" its 
spectral ^i-norm is defined as ||/||i = X^scfni l/('5')l- It is easy to estimate any Fourier coefficient of a function 
/ : {—1, 1}" — >■ [0, 1], given access to an oracle that outputs the value of / at a uniformly random point 
in the hypercube. Given any parameters e,6 > 0, we choose a set i? C {—1, 1}" of size 0(log|/e^) drawn 

uniformly at random from {—1, 1}" and estimate f{S) = tw X]a:efl[/(^) ' Xs{x)]- Standard Chernoff bounds 
can then be used to show that with probability at least 1 — (5, 1/(5*) — f{S)\ < e. 

Definition 2.5 (Spectral Concentration). For any e > 0, a Boolean function f is said to be e- concentrated 
on a set § C 21"! of indices, if 



E 



f{x)-Y^f{S)xs{^ 



s& 






The following simple observation (implicit in |KM91) ) can be used to obtain spectral concentration from 
bounded spectral €i-norm for any function /. 

Lemma 2.6. Let f : { — 1.1}" -^ M. be any function with II/II2 < 1- For any e £ (0,1], let L = ||/||i and 
"^ — {F I 1/(2^)1 ^ 2x}- Then f is e/ 2- concentrated on T and \T\ < J . Further, /ei § D T and for each 
S £S, let f{S) be an estimate of f{S) such that 

1. yS eS, 1/(5') I > 3^ and 

2. yS£S,\fiS)^f{S)\<^- 

Then, E[(/(x) -J^sesfi^) ' Xs{x))'^] < e and, in particular, \\f -J^sesfi^) 'Xslli < v^- 

This lemma shows that approximating each large Fourier coefficient to a sufficiently small additive error 
yields a sparse linear combination of parities that approximates /. For completeness we include a proof in 
Appendix |^ 

3 Learning Coverage Functions on the Uniform Distribution 

In this section we present two algorithms that PAC learn coverage functions from random examples. 

3.1 Structural Results 

We start by proving several structural lemmas about the Fourier spectrum of coverage functions. First, we 
observe that the spectral ^i-norm of coverage functions is bounded by the constant 2. 

Lemma 3.1. For a coverage function c : { — 1, 1}" -^ [0, 1], ||c||i < 2. 

Proof. From Lemma 12.11 we have that there exist non-negative coefficients as for every S C [n] such that 
c{x) = J2se\n] '^s • ORs(a:). By triangle inequality, we have: 

Pill < V as • IIORslli < max ||6Rs||i • V as < max ||6r^||i. 

^--' SCln] ^-^ SC[n] 

SC[n] -^ ^ SC[n] "^ ' 

To complete the proof, we verify that VS* C [n], ||ORs||i < 2. For this note that 
ORs(x) ^ 1 - — . n,es(l + a;*) = 1 - ^ 5] Xt{x) 

TCS 

and thus IIORslli < l + ^2l'^l =2. D 

The small spectral £i-norm guarantees that any coverage function has its Fourier spectrum e^-concentrated 
on some set T of indices of size 0{^) (Lemma 12. 6p . This means that given an efficient algorithm to find a 
set S of indices such that § is of size O(^) and § D T we obtain a way to PAC learn coverage functions to 
£i-error of e. 

In general, given only random examples labeled by a function / that is concentrated on a small set T 
of indices, it is not known how to efficiently find a small set § D T, without additional information about 
T (such as all indices in T being of small cardinality). However, for coverage functions, we can utilize a 
simple monotonicity property of their Fourier coefficients to efficiently retrieve such a set S and obtain a 
PAC learning algorithm with running time that depends only polynomially on 1/e. 

Lemma 3.2 (Lemma 11.21 restated). Let c : {—1,1}" — > [0,1] be a coverage function. For any non empty 

TCVC[n], |c(l/)| < |c(r)| < 



Proof. From Leinnia l2.1l wc have that there exist constants as > for every S C [n] such that X^scfni Q^s ^ 1 
and c(x) = X^se W Q^sORs(x) for every x € { — 1, !}"■ The Fourier transform of c can now be obtained simply 
by observing, as before in Lemma [XT] that 0R5(a;) = 1 — 2^ J2tcs Xt{x)- Thus for every T ^ 0, 



c(^) = - E "s • (^ 






Notice that since all the coefficients as are non-negative, c{T) and c{V) are non-positive and 

\c{T)\ = E «5 • (i) ^ E "s • (^) = 1^(^)1 • 



SDT SDV 



For an upper bound on the magnitude |c(T)|, we have: 

|c(r)| = ^ a^ . _ < ^ as . _ < ( ^ as) . _ < _. 

SDT SDT SC[n] 

D 

We will now use Lemmas 13.11 and 13.21 to show that for any coverage function c, there exists another 
coverage function c' that depends on just 0(l/e^) variables and ^i-approximates it within e. Using Lemma 
13.11 we also obtain spectral concentration for c. 

We start with some notation: for any x G {—1, 1}" and a subset J C [n] of variables, let xj e { — 1, 1}"' 
denote the projection of x on J. Given y E {—1, 1}'^ and z e {—1, 1}'^, let a; = y o z denote the string in 
{—1, 1}" such that xj — y and xj — z (where J denotes the set [n] \ J). 

We will need the following simple lemma that expresses the Fourier coefficients of the function // which 
is obtained by averaging a function / over all variables outside of / (a proof can be found for example in 
|KM93j 'l. 

Lemma 3.3. For f : {-1,1}" -^ [0,1] and a subset I C [n], let fi : {-1,1}" -> [0,1] be defined by 
fi{x) = Ey^^_-^^^y[f{xj o y)]. Then, fi{S) = f{S) for every SCI and fj{T) = for every T <^I. 

We now show that coverage functions can be approximated by functions of few variables. 

Theorem 3.4 (Theorem 11.31 restated). Let c : { — 1,1}" —> [0,1] be a coverage function and e > 0. Let 
I ~ {i £ [n] \ |c({i})| > ^}. Let ci be defined as ci{x) — E r j^ j^ij[c(a;/ o j/)]. Then ci is a coverage 
function that depends only on variables in I, \I\ < 4/e^, size(c/) < size(c) and ||c— c/||i < e. Further, let 
T = {T C [n] I |c(T)| > y}- Then T C 2^ and c is e^ -concentrated T. 

Proof. Since c is a coverage function, it can be written as a non-negative weighted sum of monotone dis- 
junctions. Thus, for every v e {—1, 1}^ the function, Cy : {—1, 1}" — J> [0, 1] defined as Cv{z o y) = c{z o v) 
for every y S {—1, 1}^ is also a non-negative linear combination of monotone disjunctions, that is a coverage 
function. By definition, for every z € {— 1,1}''^ and j/G {-1,1}-^, 

cK^ ° 2/) = ^;i^^ E ^(^°^) = ^;r^m E ^-i^°y)- 

ue{-i,i}^" ve{-i.i}' 

In other words, c/ is a convex combination of c„'s and therefore is a coverage function itself. Note that for 
every S* C / if the coefficient of OR5 in c/ is non-zero then there must exist S' C I for which the coefficient 
of ORsuS' in c is non-zero. This implies that size(c/) < size(c). 

We will now establish that cj approximates c. Using Lemma [3.3) c{S) = cj{S) for every SCI. Thus, 
l|c - c/lll = Y.T'ii cirY- We first observe that T C 2^. To see this, consider any T ^ /. Then, 3i ^ / such 
that J G T and therefore, by LemmaESl |c(r)| < |c({j})| < e^ /2. Thus |c(T)| < eV2. 

By Lemma |2. 61 c is e^-concentrated on T and using Cauchy-Schwarz inequality, 

lie- c,||? < lie- cii^ = ^ c(r)^ < 5] c(r)2 < .^ 

D 



3.2 PAC Learning Algorithm 

We now describe our PAC learning algorithm for the class of coverage functions. Using random examples 
of the target coverage function, we compute all the singleton Fourier coefficients and isolate the set / of 
coordinates corresponding to large (estimated) singleton coefficients and includes / — {i G [n] \ \c{{i})\ > ^}. 
Theorem 13.41 guarantees that the target coverage function is concentrated on the large Fourier coefficients 
the indices of which are subsets of /. We then find a collection § C 2-^ of indices that contains all T C / such 
that |c(T)| > £2/4. This can be done efficiently since by Lemma [X^ \c(T)\ > 6^/4 only if \c{V)\ > e^/A for 
alW CT,V ^ 0. 

We can only estimate Fourier coefficients up to some additive error with high probability and therefore we 
keep all coefficients in the set S whose estimated magnitude is at least 6^/6. Once we have a set S, on which 
the target function is e^-concentrated, we use Lemma [2?6l to get our hypothesis. We give the pseudocode of 
the algorithm below. 

Algorithm 1 PAC Learning of Coverage Functions 

1: Set 6* = 4- 

2: Draw a random sample of size mi — 0(log (n)/e^) and use it to estimate c({i}) for all i. 

3: Setl = {ie [n] I \£i{i})\>0}. 

4: S^{0}. 

5: Draw random sample R of size 1712 = 0(log (l/e)/e'*). 

6; for i = 1 to log(2/6') do 

7; for each set T e § of size < — 1 and i E I\T do 

8: Use R to estimate the coefficient c{T U {«}). 

9: If |5(rui)| >6'thenS^SU{ru{i}} 

10: end for 
11: end for 
12: return Esesc(5') • xs- 

Theorem 3.5 (Theorem 11.51 restated). There exists an algorithm that PAC learns CV in 0(njt^ + 1/e^) 
time and using \ogn ■ 0(l/e^) examples. 

Proof. Let c be the target coverage function and let T = {T C [n] | \c{T)\ > ^}. By Lemma [2.61 it is 
sufficient to find a set S 3 T and estimates c{S) for each S* G § such that: 

1. V5e§ \c{S)\ > 4 and 

2. V5eS, \d{S)-c{S)\ < f^. 

Let 9 = 6^/6. In the first stage our algorithm finds a set / of variables that contains I = {i G [n] \ 
|c({i})| > ^}. We do this by estimating all the singleton Fourier coefficients, {c({i}) | i £ [n]} within 
6/2 with (overall) probability at least 5/6 (as before we denote the estimate of c{S) by c{S)). We set 
/ = {i S [n] I £({?})! > 6}. If all the estimates are within 6/2 of the corresponding coefficients then for every 
i e /, c{{i}) > eV4 - 6/2 = e'^/6 = 6. Therefore i e / and hence I C I. 

In the second phase, the algorithm finds a set § C 2^ such that the set of all large Fourier coefficients T 
is included in §. This is done iteratively starting with § — {0}. In every iteration, for every set T that was 
added in the previous iteration and every i G I\T, it estimates c(TU{i}) within 6*72 (the success probability 
for estimates in this whole phase will be 5/6). If \c{T U {i})\ > 6 then T U {i} is added to S. This iterative 
process runs until no sets are added in an iteration. At the end of the last iteration, the algorithm returns 
SsgS ^('^)xs as the hypothesis. 

We first prove the correctness of the algorithm assuming that all the estimates are successful. Let T £ T 
be such that |c(r)| > e'^/A. Then, by Theorem O T C I C I. In addition, by Lemma |X1 for aU V CT, 
V ^<i), \c{V)\ > eV4. This means that for allV CT,V ^9 an estimate of \c{V)\ within 6*72 will be at 
least 6. By induction on t this implies that in iteration t, all subsets of T of size t will be added to § and T 
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will be added in iteration \T\. Hence the algorithm outputs a set § such that T C §. By definition, VS* G §, 

\c{S)\ >6»= 4 andVS'eS, \d{S) - c{S)\ <9/2 = ^. By Lemma [lH ||c - ^sggc(S')xs||i < £■ 

We now analyze the running time and sample complexity of the algorithm. We make the following 
observations regarding the algorithm. 



• 



• 



By Chernoff bounds, 0{\og{n)/6^) — 0(log (n)/e'') examples suffice to estimate all singleton coeffi- 
cients within 9/2 with probability at least 5/6. To estimate a singleton coefficients of c, the algorithm 
needs to look at only one coordinate and the label of a random example. Thus all the singleton 
coefficients can be estimated in time 0(nlog {n)/e*). 

For every S such that c{S) was estimated within 9/2 and \c{S)\ > 6, we have that \c{S)\ > 6/2 — e^/12. 
This implies that |/| < 2/{9/2) = 2^/<?. This also implies that |§| < ^/9 = 2A/t'^. 

By Lemma [XI for any T C [n], |c(r)| < ^. Thus, if \c{T)\ > 9/2 then \T\ < log (2/6'). This 
means that the number of iterations in the second phase is bounded by log (2/9) and for all S E S, 

\S\<log{2/9). 

• In the second phase, the algorithm only estimates coefficients for subsets in 

§' = {SU{i} I 15(5)1 > 61 and i e /}. 

Let T' = {T U {i} \ \c{T)\ > 9/2 and i E I}. By Chernoff bounds, a random sample of size 
0(log |T'|/0^) ~ 0(l/e^) can be used to ensure that, with probability at least 5/6, the estimates 
of all coefficients on subsets in T' are within 6'/2. When the estimates are successful we also know 
that §' C T' and therefore all coefficients estimated by the algorithm in the second phase are also 
within 9/2 of true values with probability > 5/6. Overall in the second phase the algorithm estimates 
|§'| < |§| • |/| = 0(l/e^) coefficients. To estimate any single of those coefficients, the algorithm needs to 
examine only log (2/6*) = 0(log (1/e)) coordinates and the label of an example. Thus, the estimation of 
each Fourier coefficient takes 0(1/6"*) time and 0(l/e^) time is sufficient to estimate all the coefficients. 

Thus, in total the algorithm runs in 0{n/e^ + 1/e^) time, uses logn • 0(I/e*) random examples and succeeds 
with probability at least 2/3. D 

It can also be easily seen from our analysis that the necessary estimates of Fourier coefficients can be 
obtained using statistical queries of tolerance 6'/2 = e^/12. 

Corollary 3.6. There exists an algorithm that PAC learns CV using 0{n + l/e"*) statistical queries of 
tolerance i^(e^) and running in time 0{n + 1/e'*). 

3.3 Proper PAC Learning 

The algorithm in the previous section runs in time poly(n, -) for any coverage function, regardless of its 
size but the output hypothesis need not be a coverage function. We now present a PAC learning algorithm 
for coverage functions that guarantees that the returned hypothesis is also a coverage function. That is, 
the algorithm is proper. The running time of the algorithm will depend polynomially on the size of the 
target coverage function. Invoking Thm. 13.41 again, we will essentially be able to assume that the size of any 
coverage function is upper bounded by (l/g)'^(i°g(i/<^)). 

The algorithm we present will output a non- negative linear combination of disjunctions, where the sum 
of the weights of the linear combination is at most 1. The basic idea of the algorithm is to find a small set 
§ of indices for which there exists non-negative reals as with X^ses'^s — ^ such that E[|c(a;) — X^ses'^s ' 
0R5(a;)|] < e. Given §, we can find a good non-negative linear combination as above using LAE-LP described 
in Thm. 12.41 (and Remark |2. 1 1) . We will show that for any coverage function c, there exists a set of indices 
§ (that depends on c) as above and moreover, that we can find such a set § using just random examples 
labeled by c. This will give us our proper learning algorithm. 

To find a set § as above, we use the Fourier coefficients of the target coverage function. First, we will 
invoke Thm. 13.41 to find a set / C [n] of coordinates of size 0(l/e^) just as in Thm. 13.51 and show that we 
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need only look at subsets of / to find §. Following this, we will prove that § can be identified using the 
following property: if 5' G S, then the Fourier coefficient c{S) is large enough in magnitude. 

We begin with a simple lemma that shows that if a disjunction has a small number of variables and a 
significant coefficient in a coverage function then the corresponding (i.e. with the same index set) Fourier 
coefficient is significant. 

Lemma 3.7. Let c ~ X^scfni Q^sORs ^c ^ coverage function. Then, for any T C [n], |c(T)| > 2^1^' • ar ■ 

Proof. From the proof of Lemma 13.21 we have for any non-empty T ^ 0, c{T) = — X^sdt'-'^S' ' (?^)- Each 
term in the summation above is non-negative. Thus, |c(r)| > 2~l^l • ar. □ 

We now show that for a coverage function c € CV, using significant Fourier coefficients of c, we can 
identify a coverage function c' of size at most min{size(c), (l/e)0(iog(i/e))j. that £i approximates c within e. 
Moreover, c' depends on just 0{l/e^) variables. 

Lemma 3.8. Let c : { — 1, 1}" — > [0, 1] be a coverage function and e > 0. Let I — {i ^ [n] \ \c{{i})\ > e^/i8}. 
Set s, = min{size(c), |/|i°g(3/^)} and let 

T, = {T C / I |c(T)| > eV(9s.) and \T\ < log (3/e)} U {0}. 

Then, 

1. \I\ < 36/62, 

2. There exists d € CV such that c' = X^TgT ^T ' O^T o,nd \\c — c'||i < e. 

Proof. Let ar for each T C [n] be the coefficients of the disjunctions of c, that is, c = X^TCfni ^t ■ ORt 
for non- negative ax satisfying X^TCfni '^t < 1- Using Theorem 13.41 we know that the coverage function c/ 



(defined as ci{x) 



E 



y~{-l.,l} 



I l^I[c{xI o y]) depends only on variables in / and \\c — c/||i < e/3. Therefore, 



c/ = J2tci Pt -ORt for some constants /3t > 0, X^tc/ Pt < 1- Using Lemma [Ol we obtain that |/| < 36/6^ 
and thus s^ < min{size(c), (6/e)2'°s(3/«)}. 

Let Ti = {T C / I < ^T < 3^- and \T\ < log(f)} and T2 = {T C / | |T| > log(f)}. Consider any 
T C / such that T ^ Ti U Ta. Then \T\ < log (3/e) and /3t > e/(3s,). This, by LemmaO applied to c/, 
implies that |c/(T)| > eV(9s,). By LemmaO |c(r)| = |c/(T)| > eV(9se). This implies that T e T, and 
thus every T C / is in T^ U Ti U T2. 

Set V — X^TeT I^T and let c' = Xtgt I3tORt{x) + v. Clearly c' G CV and c' = Xtst '^t ' ORt for 
some coefficients a'j,. For c' we have: 



|c/-c'||i=E 



J2 Pt ■ ORt{x) - i Y. PtORt{x) +Y.M 

TCI VtsT, TeTj / 



< 



J2 Pt ■ ORt{x) 



TeTi 



-E 



Y, Pt ■ (ort(x) - i; 



TGT2 



Y E [\fiTORT{x)\] + ^ /3t • Pr[0RT(2;) = 0] 



TeTi 



TeT2 



By Theorem 13.41 size(c/) < size(c) and therefore |Ti| < s^. For each T e T2, \T\ > log(-) which gives 
Pr[ORT(a;) = 0] < e/3. Therefore, 



||c/-c'||i<|Ti|. — -|-e/3-l<2e/3. 

Thus, ||c-c'||i<||c-C7||i + ||c/-c'||i <e. 

We can now describe and analyze our Proper PAC learning algorithm for CV. 



D 
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Algorithm 2 Proper PAC Learning of Coverage Functions 



Set e=^,s,^ min{size(c), (12/e)ri°g(6A)l}. 

Draw a random sample of size mi = 0(log (n)/e^) and use it to estimate, c{{i}) for all i. 

Set i={i€ [n] I \d{{i})\>e}. 

S^{0}. 

Draw random sample R of size m2 = 0{s^ log (se/e)/e^). 

for t = 1 to [log(6/e)] do 

for each set T g S of size i — 1 and i £ I\T do 
Use R to estimate the coefficient c{T U {«})• 
If |c(TUi)| >6'thenS^§U{ru{i}} 
end for 
end for 

Draw a random sample R of size 7713 — 0{s^/e^) and use LAE-LP to minimize ^^^ y)eRl\v ^ Sses "^s ' 
0R5(a;)|] subject to '}2,ses^s — ^ ^^^ as > for all S* G S. Let a*g for each 5* e § be the solution. 
13: return I]sgs"sORs- 



9: 
10: 
11: 
12: 



Theorem 3.9 (Thm. [L6l restated. Proper PAC Learning). There exists an algorithm, that for any e > 0, 
given random and uniform examples of any c G CV , with probability at least 2/3, outputs h £ CV such 
that \\h ~ c\\i < e. Further, size(/i) = 0{se/e^) and the algorithm runs in time 0{n) ■ poly(se/e) and uses 
log(n) • 0(Sg/e"') random examples, where s^ — min{size(c), (12/e)^'^'°s(6/c)l |^ 

Proof. We first describe the algorithm and then present the analysis. We break the description of the 
algorithm into three stages. The first two stages are similar to those of Algorithm [TJ 

Let 9 = e^/108. In the first stage, our algorithm finds a set / of variables that contains I = {i £ [n] \ 
|c({j})| > (e/2)^/18}. We do this by estimating all the singleton Fourier coefficients, {c({i}) | i £ [n]} within 
6/2 with (overall) probability at least 8/9. As before, we denote the estimate of c{S) by c{S) for any S. We 
then set I = {i £ [n] \ c{{i})\ > 6}. If all the estimates are within 6/2 of the corresponding coefficients then 
for every i £ I, £{{i}) > e^ /12 - 6/2 = eVl08 = 6. Therefore i e / and hence / C /. 

In the second stage, the algorithm finds a set S C 2^ such that the set of all large Fourier coefficients T 
is included in S. Just as in Theorem 13.51 this is done iteratively starting with § = {0}. In every iteration, 
for every set S that was added in the previous iteration and every i £ I\S,\t estimates c.{S U {«}) within 
e^/(108se) (the success probability for all estimates in this phase will be 8/9). If |c(5' U {i})| > e^/(54se) 
then S U {i} is added to S. The iterative process is run for at most [log (6/e)] iterations. 

Finally, the algorithm draws a random sample R of size 7713 and uses LAE-LP fTheorem l2.4p to minimize 
S(a; !/)6-r[|2^ ~ X^ses'^s ' ORsl^)!] subject to X^sesQ^s — 1 ^^^d as > for all 5 G S. Let aj for each S* € S 
be the solution. Here TO3 is chosen so that, with probability at least 8/9, E[|c(a;) — Xses^J • 0R5(a;)] (the 
true error of Xsgs '^*s ' 0Rs(2;)) is within e/2 of the optimum. Standard uniform convergence bounds imply 
that m3 — 0(|S|/e^) examples suffice. The algorithm returns Xsgs'^s ' '^^s as the hypothesis. 

We can now prove the correctness of the algorithm assuming that all the estimates are successful. By an 
argument similar to the one presented in the proof of Theorem l3.5[ we can verify that / C / and that T C §. 
Using Theorem 13.81 and the facts that § 2 T and / D /, we obtain that there must exist non- negative olg 
with Yjs&^s ^ 1 such that E[|c(a:) - Eses^s ' ORs(a;)|] < e/2. Thus LAE-LP in the third stage wih find 
coefficients aj for each S* G § such that E[|c(a;) — Xses"^!? ' ORs(a:)|] < e. 

We now analyze the running time and sample complexity of the algorithm. The analysis for the first two 
stages is similar to the one presented in Theorem 13.51 

• Just as in the proof of Theorem l3.51 all the singleton coefficients can be estimated in time 0(n log (n)/e'*) 
and samples 0(log (n)/t^) with confidence at least 8/9. 

• For every i such that c({i}) was estimated within 6/2 and |c({i})| > 6, we have that |c({j})| > 6/2. 
This implies that |/| < 2/(6/2) — 0(l/e^). Similarly, for every SCI such that c{S) was estimated 
within eV(108s,) and \c{S)\ > eV(54se), we have that \c{S)\ > e'^/l08s^. Thus using Lemma O 

2\ 



|S| = 0{sje') 
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• In the second stage the algorithm only estimates coefficients with indices that are subsets of 

§' = {SU{i} I \c{S)\ > eV(54s,) and i e / and IS"] < log(6/e)}. 

We can conclude that |§'| = 0{se/e'^) and, as in the proof of Theorem 13.51 the estimation succeeds 
with probability at least 8/9 using 0(s^/e'') examples and running in time 0{s'^/e^). 

• Finally, using Theorem 12.41 LAE-LP will require 0(|S|/e'^) = 0{s^/e*) random examples and runs in 
time polynomial in \S\ and 1/e, that is poly(s£/e). 

Overall, the algorithm succeeds with probability at least 2/3, runs in time 0{n) ■ poly(.Se/e) and uses 
log (n) • 0(Sg/e'*) random examples. D 

It can also be easily seen from our analysis that the necessary estimates of Fourier coefficients can be 
obtained using statistical queries of tolerance e^/(108se)- In addition, LAE-LP can be solved in polynomial 
in Se and 1/e time using statistical queries of tolerance lower bounded by the inverse of a polynomial in s^/e 
(we are not aware of an explicit reference but the modified Perceptron algorithm by Blum et al. JBFKV97] 
or the algorithm of Dunagan and Vempala |DV08j can be easily adapted to this problem). This gives us the 
following corollary. 

Corollary 3.10. There exists an algorithm, that for any e > 0, given access statistical query oracle for any 
c € CV , outputs h G CV such that ¥ju[\h{x) — c(2;)|] < e. The algorithm runs in time poly(n, s^, -), uses 
0{n) + poly(S(:/e) statistical queries of tolerance lower bounded by — \ (^ / ) '^'^^ outputs a coverage function 
of size 0{sjt^), where s, = min{size(c), (12/e)2ri°g(6/01}. 

4 Agnostic Learning 

In this section, we show that coverage functions can be approximated in i?i-norm by a non-negative linear 
combination of "short" monotone disjunctions. This will lead to a simple proper agnostic learning algorithm 
for the class of coverage functions on the uniform distribution. We then give a reduction from the problem 
of learning sparse parities with noise which implies that substantial improvement to our algorithm would 
require a breakthrough for the problem of learning sparse parities with noise. 

4.1 Proper Agnostic Learning over the Uniform Distribution 

Our approximation is based on truncating the expansion of a coverage function in terms of monotone dis- 
junctions (Lemma 12. ip to keep only the terms corresponding to short disjunctions. We show that such a 
truncation is enough to approximate the function with respect to ti error. Recall that by Lemma |2.H such 
a truncation is itself a coverage function. 

Theorem 4.1. Let c{x) = X^serni ^s ■ 0R5(a;) be a coverage function with range in [0, 1] and e > 0. Then, 
for k = [log (i)] the coverage function c' — X]|S|>fc ^S + X]|S|<fc '^S ' ORg satisfies \\c — c'||i < e. 

Proof. By Lemma 12.11 and the fact that c is a coverage function we obtain that c' is a non-negative com- 
bination of monotone disjunctions with the sum of coefficients being at most 1, that is a coverage function 
itself. For each x E {—1, 1}", non-negativity of the coefficients as implies c{x) < c'{x). 
Now, 

\\c-c'\\i=E[\c{x)~c'{x)\]^ E [c'{x)~c{x)] 



= V E[as-as-ORs]^ V Pr[ORs = 0] • as 

SC[n],|5|>fe SC[n],|S|>fc 

< 2, as ■ e < £■ 

S<Z[n],\S\>k 

D 
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As an immediate corollary of Theorem 14.11 and Theorem 12 .41 we obtain an algorithm for agnostic learning 
of coverage functions on the mriform distribution. 

Theorem 4.2 (Proper Agnostic Learning of Coverage Functions). There exists an algorithm, that agnos- 
tically learns the class CV in time n'~'^'^°^^~^\ Further, the hypothesis returned by the algorithm is itself a 
coverage function. 



Proof. Theorem 14.11 shows that every coverage function can be approximated by a non-negative linear com- 
bination of monotone disjunctions of length log (i) within e in the £i-norm. Now, Theorem 12 .41 immediatelv 
yields an agnostic learning algorithm. D 

4.2 Computational Lower Bounds for Agnostic Learning of Coverage Functions 

We now remark that any algorithm that agnostically learns the class of coverage functions on n inputs in time 
j^o(iog(-)) -^ould yield a faster algorithm for the notoriously hard problem of Learning Sparse Parities with 
Noise(SLPN). The reduction only uses the fact that coverage functions include all monotone disjunctions 
and is based on a technique implicit in |KKMS08l IFell2| . 

We say that random examples of a Boolean function / have noise of rate 77 if the label of a random 
example equals f{x) with probability 1 — 77 and 1 — f{x) with probability 77. 

Problem (Learning Sparse Parities with Noise). Forrj G (0, 1/2) and k < n the problem of learning k-sparse 
parities with noise rj is the problem of finding (with probability at least 2/3) the set S C [jiJjjS'l < k, given 
access to random examples with noise of rate 77 of parity function xs ■ 

The fastest known algorithm for learning /c-sparse parities with noise 77 is a recent breakthrough result 
of Valiant which runs in time 0{n°-^''po\y{j^)) |Vall2] . 

Kalai et. al. [KKMS08] and Feldman IFell2I prove hardness of agnostic learning of majorities and 
conjunctions, respectively, based on correlation of concepts in these classes with parities. We state below 
this general relationship between correlation with parities and reduction to SLPN, a simple proof of which 
appears in |FKV13| . 

Lemma 4.3. Let C be a class of functions mapping {—1,1}" into [—1,1]. Suppose, there exist 7 > and 
fc G N such that for every S C [n], \S\ < k, there exists a function, fs G C, such that \fs{S)\ > "/{k). If there 
exists an algorithm A that learns the class C agnostically to accuracy e in time T{n, -) then, there exists an 
algorithm A' that learns k-sparse parities with noise rj < 1/2 in time poly(7i, ,._„"'", ,,. ) + 2T{n, /i^o^') ffc) )' 

The correlation between a disjunction and a parity is easy to estimate. 

Fact 4.4. For any SC[n], \{ORs,xs)\ = ^m^- 

We thus immediately obtain the following simple corollary. 

Theorem 4.5. Suppose there exists an algorithm that learns the class of Boolean disjunctions over the 
uniform distribution agnostically to an accuracy of e > Q in time T{n,-). Then there exists an algorithm 

that learns k-sparse parities with noise r] < -^ in time po\y{n, j^^) + 2T{n, ^_, ). In particular, ifT(n, -) — 
^o{iog{i/e)) ^ f/ien, there exists an algorithm to solve k-SLFN in time n°^'^> . 



Thus, any algorithm that is asymptotically faster than the one from Theorem |42] yields a faster algorithm 
for fc-SLPN.' 

5 Application to Privately Releasing Monotone Disjunctions 

In this section we use our PAC learning algorithms to derive improved private algorithms for releasing 
monotone disjunction (or conjunction) counting queries. We begin with the necessary formal definitions. 
Differential Privacy: We use the standard formal notion of privacy, referred to as differential privacy, 
proposed by Dwork et al. |DMNS06| . For some domain X, we will call Z? C X a database. Databases 
D, D' d X are adjacent if one can be obtained from the other by adding a single item. In this paper, we will 
focus on boolean databases, thus, X C {—1, 1}" for tt, G N. We now define a differentially private algorithm. 
In the following, A is an algorithm that takes as input a database D and outputs an element of some set R. 
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Definition 5.1 (Differential privacy [DMNS06] ). An (randomized) algorithm A : 2^^ ^ R is e-differentially 
private if for allr £ R and every pair of adjacent databases D, D' , we have Vt:[A{D) — r] < e'^ 'Pi[A{D') — r\. 

Private Counting Query Release: We are interested in algoritlims that answer predicate counting 
queries on Boolean databases. A predicate counting query finds the fraction of elements in a given database 
that satisfy the predicate. More generally, given a query c : {—1, 1}" — ^ [0, 1], a counting query corresponding 
to c on a database D C {—1, 1}" of size m := \D\, expects in reply qdD) — l/mJ2reD '^('')- ^^ our applica- 
tions, we will only insist on answering the counting queries approximately, that is, for some tolerance r > 0, 
an approximate counting query in the setting above expects a value v that satisfies \v — 1/m X^rPD ''('")l — ''"■ 
A class of queries C mapping {—1, 1}" into [0, 1], thus induces a counting query function CQ^ : C — ^ [0, 1] 
given by CQ^(c) = qc{D) for every c £ C. For a class C of such functions and a database D, the goal of a 
data release algorithm is to output a summary H : C ^f [0,1] that provides (approximate) answers to queries 
in C. A private data release algorithm additionally requires that H be produced in a differentially private 
way with respect to the participants in the database. One very useful way of publishing a summary is to 
output a synthetic database D C {—1, 1}" such that for any query c £ C, qc{D) is a good approximation 
for qc{D). Synthetic databases are an attractive method for publishing private summaries as they can be 
directly used in software applications that are designed to run on boolean databases in addition to being 
easily understood by humans. 

For a class C of queries mapping {—1, 1}" into [0, 1], and a distribution 11 on C, an algorithm A (q,/3)- 
answers queries fromC over a database D on the distribution 11, if for H — A{D), Pry^n[|CQj5(/) — iJ(/)| < 
o] > i — p. For convenience we will only measure the average error a and require that Kf^Yi[\CCljj{f) — 
^(/)|] ^ fx. Clearly, one can obtain an (a, /3)-query release algorithm from an a-average error query release 
algorithm by setting a = a ■ (3. The average error is exactly the £i-error in approximation of CQ^) over 
distribution 11. 

We will need the following proposition that Gupta et at jGHRUlT] prove using technique from jBDMNOS] 
to relate the problem of privately releasing counting queries on functions from a class C to the problem of 
learning a related class of functions by tolerant counting queries. 

Proposition 5.1 f [GHRUll] ). Let A denote an algorithm that uses q counting queries of tolerance t in its 
computation. Then for every e,d > 0, with probability 1~ 5, A can be simulated in an e- differentially private 
way provided that the size of database \D\ > q{logq + log(l/(5))/(e • t). Simulation of each query of A takes 
time 0{\D\). 

The key observation for using our algorithm (and also used in prior work [GHRUlfl ICKKL121 IHRS12[ 
ITU VI 2) ) is that for any database D, CQ^, can be seen as a function on {—1,1}" and is an average of 
monotone disjunctions, that is a coverage function. We include the simple proof for completeness. 

Lemma 5.2. For x G { — 1, 1}" we associate x with Sx C [n] such that Xi = —1 iff i £ S. For database D, 
let CD '■ {^1, 1}" ^ [0, 1] be defined as cd{x) = CQlj-){ORsS}- Then cd is a coverage function. 

Proof By definition CQ^(ORsJ - l/l^l EreD 0^5.(0- Note that ORsAr) = y^e[n]r^ ■ x, = ORsM- 
Then, 



rS-D ' ' 



D 



For a class of queries C, let 



= {CQ^ii9c{-i,in 



be the class of all counting query functions, one for each database D. Each CQ^, maps C in to [0, 1]. Lemma 
5.21 implies that for the class of monotone disjunction C, Q(C) is a subset of CV. 

We now show that we can implement Algorithm [T] using tolerant counting query access to the database 
D and thereby obtain a private data release algorithm for monotone disjunctions. 
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Theorem 5.3 fThm. fTTTl restateci). LetC he the class of all monotone disjunctions. For every e,S > 0, there 
exists an e- differentially private algorithm which for any database D C {—1, 1}" of size fl{n\og{l/S)/{ea^)), 
with probability at least 1 — 6, publishes a data structure H that answers queries from C with respect to the 
uniform distribution with average error of at most a. The algorithm runs in time 0(n^ \og{l / S) / (ea^'^)) and 
the size of H is logn • 0(l/a*). 

Proof. In Algorithm [T] the random examples of the target coverage function c are used only to estimate 
Fourier coefhcients of c within tolerance 9/2. Thus to implement Algorithm [1] it is sufficient to show that 
for any index set T C [n], we can compute cd{T) within 9/2 using tolerant counting query access to D. 
Consider any set T C [n] and recall that for any x E { — 1, 1}", S^ = {i \ xi = —1}. Then, we have 



6B(T)= E [cn{x) ■ xt{x)] = E 



II" II reD 

Define FT{r) = (1 + 0R5^(T))/2. Now Ft is a function with range [0, 1] and from equation ([1]) above, 
we observe that cd{T) can be estimated with tolerance 9/2 by making a counting query for Ft on D 
with tolerance 9/ A. We now note that €i-error of hypothesis h over the uniform distribution on { — 1, 1}" 
is the same as the average error a of answering counting queries using h over the uniform distribution 
on monotone disjunctions. Therefore 0/4 ~ (q:)^/24. The number of queries made by the algorithm is 
exactly equal to the number of Fourier coefficients estimated by it which is 0{n -\- l/a*). The output 
of Algorithm [1] is a linear combination of 0(l/a^) parities over a subset of 0{l/a^) variables and hence 
requires logn • 0(l/a'*) space. Note that given correct estimates of Fourier coefficients Algorithm [1] is always 
successful. By applying Proposition 15. 1[ we can obtain an e-differentially private execution of Algorithm [1] 
that succeeds with probability at least 1 — 5 provided that the database size is 

^ / (n + a') login + a^) + \og{l/6) \ ^ ^^^ ^,^^, / s) / ^.g^)) , 



The running time is dominated by the estimation of Fourier coefficients and hence is 0{n+ 1/a*) = Oin/a'^) 
times the size of the database. D 

We now use our proper PAC learning algorithm for coverage functions to obtain an algorithm for synthetic 
database release for answering monotone disjunction queries. 

Theorem 5.4 fThm. fTTSl rcstated). LetC be the class of all monotone disjunctions. For every e,S > 0, there 
exists an e-differentially private algorithm which for any database D C {—1, 1}" of size n ■ q;"^''"^^^'"-'-' • 
log(l/^)/e, with probability at least 1 — S, releases a synthetic database D of size a^'^'-^°s(^''^>> that can answer 
queries from C with respect to the uniform distribution with average error of at most a. The algorithm runs 
m time n^ ■ a-0(i°s(i/«)) . log(l/(5)/e. 

Proof. We will show that we can implement our proper PAC learning algorithm for coverage functions 
(Algorithm [2]) to learn Q(C) with tolerant counting query access to the database D. Algorithm [2] returns a 
hypothesis H : {—1, 1}" — ^ [0, 1] given by a non-negative linear combination of monotone disjunctions that 
£i-approximates cd. We will then show that we can construct a database D using H which, up to a small 
discretization error, computes the same function as H. As before, an e-differentially private version of this 
algorithm is then obtained by invoking Proposition 15. II 

Algorithm [2] uses the random examples from the target function to first estimate certain Fourier coef- 
ficients and then to run the LAE-LP (Theorem 12. 4p to find the coefficients of the linear combination or 
disjunctions. We have already shown that tolerant counting queries on the database can be used to estimate 
Fourier coefficients. We now show how to implement the LAE-LP via tolerant counting queries. Recall that 
we proved that there exist monotone disjunctions ORg^ , ORsj , . . . , ORg^ and non-negative reals ji for i G [t] 
satisfying J2ie\t\ 7j — 1 such that E[|c£)(x) — J2ie\t] 7» ' 0Rsi(2:)|] < ^, for some t and A. The LAE-LP is 
run on poly(t, l/rj) random labeled examples on which it minimizes the average absolute error (subject to 

17 



coefficients being non- negative and summing to at most 1). It returns non- negative coefficients 7* for i G [t] 
such that E[\cd{x) - J2^e[t] it ' 0Rs.(2:)|] < A + ry. 

We can simulate a random example of the target coverage function CQ^^ by drawing x e { — 1,1}" 
uniformly at random and making the counting query on the disjunction Sx ~ {i \ Xi = ^1}- Since we 
can only use r-tolerant queries, we are guaranteed that the value we obtain, denote it by cd(x), satisfies 
\cd{x) — cd{x)\ < T. This additional error in values has average value of at most r and hence can cause 
LAE-LP to find a solution whose average absolute error is up to 2t worse than the average absolute error 
of the optimal solution. This means that we are guaranteed that the returned coefficients satisfy E[|c£i(2;) — 

We simulate Algorithm [2] with i'l-error of a/2 and let r be the minimum of tolerance required for 
estimating Fourier coefficients and a/8. The error of the coverage hypothesis function H is then at most 
a/2 + 2t < 3a/4. Inspecting the proof of Theorem [31] shows that r = cPi^°s{^/a)) suffices. Thus, to 
summarize, we obtain that there exists an algorithm that makes ■0, + a^'^"°^^/^' r-tolerant counting queries 
to the database D (for r as above) and uses n ■ a~'^'-'°s ^^"^ time to output a non-negative linear combination 
H of at most or'^y^°&'^/°''^ monotone disjunctions that satisfies ||c£) — i/||i < 3a/4. Applying Proposition l5.1[ 
we obtain that there exists a e-differentially private algorithm to compute an H as above with the claimed 
bounds on the size of D and running time. 

We now convert our hypothesis H{x) — X]ie[ti7i* ' ORSi(^) i^^o a database by using the converse of 
Lemma [Q] For each OR5, let x^ € {~li 1}" be defined by: for aU j e [n], Xj = —1 if and only if j e S. 
Our goal is to construct a database D in which each x^^ for i G [t] has a number of copies that is proportional 
to 7* since this would imply that c^ = H. To achieve this we first round down each 7* to the nearest multiple 
of a/(4i) and let 7^ denote the result. The function H{x) = J2i<t 7» ' Of^Sil^) satisfies ||iJ — 7?||i < a/4 and 
hence ||ij — C£)||i < a. Now we let D be the database in which each x^^ for i E [t] has Atji/a copies (an 
non- negative integer by our discretization). From Lemma [5.21 we see that c^ = H. Note that the size of D 
is at most 4i/a = a-^iios{i/a)) _ 

D 

6 Distribution-Independent Learning 

6.1 Reduction from Learning Disjoint DNFs 

In this section we show that distribution-independent learning of coverage functions is at least as hard as 
distribution-independent learning of disjoint DNF formulas. 

Theorem 6.1 fThni. [TTT1 restated). Let A be an algorithm that distribution-independently PAC learns the 
class of all size-s coverage functions from {—1, 1}" to [0, 1] in time T{n, s, -). Then, there exists an algorithm 
A' that PAC learns of s -term disjoint DNFs in time T{2n,s, —). 

Proof. Let d ~ \/i<sTi be a disjoint DNF with s terms. Disjointness of terms implies that d{x) = X)i<s ^i(^) 
for every x E { — 1, 1}". By using de Morgan's law, we have: d = s — X]i<s ^i where each Di is a disjunction 
on the negated literals in T^. We will now use a standard reduction jKLV94| through a one-to-one map 
m : {—1, 1}" —^ {^1; 1}^" and show that there exists a sum of monotone disjunctions d' on {—1, 1}^" such 
that for every x E { — 1, 1}", d'{m{x)) — s ~ d{x). The mapping m maps x E {—1, 1}" to y G {^li 1}^" 
such that for each i E [n], y2i-i ~ Xi and y2i — —Xi. To define d', we modify each disjunction Dj in the 
representation of d to obtain a monotone disjunction £)'■ and set d' — X]i<s ^'i- ^'-'^ each Xi that appears 
in Dj we include y2i~i in I?' and for each -^Xi in Dj we include y2i- Thus D' is a monotone disjunction on 
2/1, • • ■ , y2n- It is easy to verify that d{x) — s — d'{m{x)) for every x E {—1, 1}". Now d' /s = 1 — d/s is a 
convex combination of monotone disjunctions, that is, a coverage function. 

We now describe the reduction to learning disjoint DNFs. As usual, we can assume that the number 
of terms in the target disjoint DNF, is known to the algorithm. This assumption can be removed via the 
standard "guess-and-double" trick. Given random examples drawn from a distribution V on {—1,1}" and 
labeled by a disjoint DNF d and e > 0, A' converts each such example {x,y) to example {m{x), 1 — -). On 
the modified examples, A' runs the algorithm A with error parameter e/(2s) and obtains a hypothesis h' . 
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Finally, A returns the hypothesis h{x) = "s(l - h'{m{x))) > 1/2" (that is h{x) = 1 if s(l - /i'(m(.x))) > 1/2 
and h{x) — otherwise). 

To establish the correctness of A' we show that Prxr^x>[d{x) ^ h(x)\ < e. By the definition of h{x) we 
have that h{x) ^ d{x) only if \d{x) — s{\ — h'{m{x)))\ > 1/2. Thus, by the correctness of A, we have 

Pr M(a;) ^ h{x)] < 2 EJMx) - (. - sh'imix)))]] = 2s ■ E [|/.'(m(x)) _ (1 - ^)|] < e. 

Finally, the running time of our simulation is dominated by the running time of A. 

D 

6.2 Reduction to Learning of Thresholds of Monotone Disjunctions 

We give a general reduction of the problem of learning a class of bounded real- valued functions C to the 
problem of learning linear thresholds of C. It is likely folklore in the context of PAC learning and was employed 
by Hardt et al. |HRS12) in a reduction from the problem of private data release. Here we give a slightly more 
involved reduction that also works in the agnostic learning setting. The reduction preserves the distribution 
on the domain and hence can be applied both for learning with respect to a fixed distribution and also in the 
distribution-independent setting. We remark that this reduction might make the problem (computationally) 
harder than the original problem. For example while, as we demonstrated, coverage functions are learnable 
efficiently over the uniform distribution. At the same time linear thresholds of monotone disjunctions appear 
to be significantly harder to learn, in particular they include monotone CNF formulas that are not known 
to be learnable efficiently. 

For any y € R, let thr(y) : M -^ {0, 1} be the function that is 1 iff y > 0. For any class C of functions 
mapping {—1, 1}" into [0, 1], let C> denote the class of Boolean functions {thr(c — 0) | c G C, 6* e [0, 1]}. 

Theorem 6.2. Let C be a class of functions mapping {—1, 1}" into [0, 1]. LefD he any fixed distribution on 
{—1,1}" and suppose C> is agnostically learnable on T) in time T{n,-). Then, C is agnostically learnable 
on D with £i-error e in time 0{T{n, -)/e). 

Proof. We will use the algorithm A that agnostically learns C> to obtain an algorithm A' that agnostically 
learns C. Let 7^ be a distribution on { — 1, 1}" -^ [0, 1] whose marginal distribution on { — 1, 1}" is V. Let c* e 
C be the function that achieves the optimum error, that is E(a; j^)^-p[|c*(a;) — y|] = min/gc ^{x,y)'^'p[\fi^) ~y\]- 
For any 9 e [0, 1], let Ve denote the distribution on {—1, 1}" x {0, 1} obtained by taking a random sample 
(x, y) from V and outputting the sample {x, thr(y — 0)). 
Our algorithm learns C as follows. 

1. For each 1 < i < [-J = t and 9 ~ i ■ e simulate random examples from Pg and use A with an accuracy 
of e to learn a hypothesis hi. Notice that the marginal distribution of Ve on {—1, 1}" is V. 

2. Return h — e ■ X^igfti ^t ^^ ^^e final hypothesis. 

To see why h is a good hypothesis for V, first observe that for any y E [0, 1], 

< y~e- ^thr(y-i-e) < e. (2) 

ie[t] 

Let c* : {-1, 1}" -^ {0, 1} be defined by c*{x) = thr(c*(x) ~ i ■ e) for every i G [t]. Thus, < c*(a;) - e • 
J2ie\t] ^ii^) — ^ fo'" every x G { — 1, 1}"- Now, since A returns a hypothesis with error of at most e higher 
than the optimum for C> and c* G C> for every i we have: 

E [\thT{y-t-e)-h,{x)\]= Pr [£ ^ h,{x)] < Pr [i^c*{x)]+e^ Pr [thriy-i-e) ^ c*{x)] 

{x,y)~V (x,i)~'Pi., (x,e)r^Vi.^ (x,y)~V 

(3) 
for every i G [t]. Now, for any fixed y, the number of i G [t] for which thr(y — i ■ e) ^ c*{x) is at most 

\\y^£l^-\. Thus, 



^ Pr Jthr(y-*.6)^c*(x)]< E 
«e[t] 



|y-c*(a;)| 



< E 

{x,y)~V 



\y-c*{x)\ 



1. (4) 
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Now, by equations ©, then ^, and (g]). 



»6[*] 



£ E thr(y -i-e) 



lem 



e E 

(a:/y)~-P 



J2 *^^(y 

je[t] 



i-e 



E/ii(a 



,.et 



<^e + eJ2 ^ [\tiLr{y - f e) - h,{x)\] 



»£[*] 



{x,y)~V 



<^e + ey"( Pr [thr(y - i • e) 7^ c*(a;)] + e 



\y~c*ix)\ 



l + t-e 



\{x,y)~V 

< E J|2;-c*(x)|]+3e 

(a;,i;)~-p 

This establishes the correctness of our algorithm when used with e/3 instead of e. Notice that the running 
time of the algorithm is dominated by [-J runs of A and thus is at most 0{- ■ T{n, -)). This completes the 
proof. n 

The same reduction clearly also works in the PAC setting (where E(^ ,^)^-p[|?; — c*(x)|] — 0). We state it 
below for completeness. 

Lemma 6.3. Let C be a class of functions mapping {—1,1}" into [0,1], let V be any fixed distribution on 
{ — 1, 1}" and suppose that C> is PAC learnable on D in time T{n, -). Then, C is PAC learnable on T) with 
li-error e in time 0{T{n, -)/e). 

Using these results and Lemma [^TTl we can now relate the complexity of learning coverage functions with 
£i-error on any fixed distribution to the complexity of PAC learning of the class the class of thresholds of 
non-negative sums of monotone disjunctions on the same distribution. 

Corollary 6.4. Suppose there exists an algorithm that PAC learns the class CV> in time T{n, -) over a 
distribution D. Then, there exists an algorithm that PAC learns CV with £i-error in time 0{T{n, -)/e) over 
V. 

6.3 Agnostic Learning 

In this section, we observe that the agnostic learning algorithm for disjunctions running in time n'^*^^'°s(T)) 
by Kalai et al. [KKMSOS] also implies an agnostic learning algorithm for coverage functions running in the 
same time. We are not aware of a better algorithm even for distribution independent PAC learning. 

We start with a well-known bound on a degree of a polynomial that point-wise approximates a boolean 
disjunction [NS921 |Pat92l IKOS04) . 



Lemma 6.5 ( jKOS04| ). For every boolean disjunction, d : {—1, 1}" — >■ {0, 1}, there exists a polynomial p of 
degree 0{^/n\og (-)) such that for all x G {—1, 1}", \d{x) — p{x)\ < e. 



This degree bound is known to be tight |KS07] . As in Theorem l4.1[ approximation of disjunctions by poly- 
nomials of degree 0{y^\og{-)) immediately implies approximation of linear combinations of disjunctions 
by polynomials of degree 0{y^\og (i)). 

Lemma 6.6. For any coverage function c : {—1,1}" — ?■ [0,1], there exists a polynomial p of degree 
0{y/n\og (i)) such that for every x G { — 1, 1}", \c{x) = p{x)\ < e. 

Using Theorem 12. 4[ we immediately obtain an agnostic learning algorithm for coverage functions on 
arbitrary distributions. 

Theorem 6.7. CV is distribution- independently agnostically learnable with li-error in time n*^'^ ^^^«". 
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7 Open Problems 

While our results essentially solve the problems of PAC and agnostic learning of coverage functions over 
the uniform distribution, several natural questions about the learnability of this natural class of functions 
remain. We list some of them below. 

1. Which other natural distributions can coverage functions be learned on efficiently? 

2. Can coverage functions be learned properly in fully-polynomial time over the uniform distribution? 

3. Can coverage functions be learned in the more stringent PMAC model }BH12| which requires that the 
learner output a hypothesis that multiplicatively approximates the target function with high probabil- 
ity. 

4. Is distribution independent learning of coverage functions easier or harder than the well-studied problem 
of learning DNF expressions. 
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A Omitted Proofs 

Proof of Lemma \2.1\ Suppose c : {—1,1}" — > M+ is a coverage function. Then, there exist a universe U 
and sets Ai,A2, . . . , An with an associated weight function w : U ^ M"*", such that for any x £ {—1,1}", 
c{x) = I]„gLJ.. _ . . w(m). For any u£U, let S'„ = {j G [n] | u € Aj}. Then, c(a;) = J^ueu'^i'^) ' 0^s^{x) 
since ORs^(x) = 1 if and only if Xj = —1 for some j e S*. This yields a representation of c as a linear 
combination of \U\ disjunctions. 

For the converse, now let c be any function such that there exist non-negative coefficients as for every 
S C [ti],S ^ such that c(x) = J2se\n] s^H^S ' Of^sl^)- We will now construct a universe U and sets 
Ai, A2, . . . , An with an associated weight function it; : L/ — >■ [0, 1] and show that f{x) — X^ugu - A ''^i'^)- 
For every non zero as add a new element us to each Ai such that i £ S and let 'w{us) — as- Let [/ = 
Ui<i<,iA,;. Now, by our construction, c{x) = J2us:as^a^s0^s{x) = Z]«eu,.x.=_iA, ^('") ^^^ required. D 



Proof of Theorem \2.b\ First, observe that: 



E/(r)'<-\x|/(r)i-E 1/(^)1 

^^E 1/(^)1 = ^ll/lli^^A 

rc[„] 

Thus / is e/2-concentrated on T. Further, |T| < ^^ since for each TeT, |/(r)| > 2]^- 
By Plancherel's theorem, we have: 

iE[(/(^) - E /(^) • ^s{x)n = E[(/(^) - hs)f] + E f^sf . 

ses ses s^s 

The first term, 

^[(/(5) - /(5))2] < max{|/(5) - 7(5)1} • Y.{\f{S) - f{S)\} < ^ • E l/(^) " /(^)l- (5) 

For each 5 G T, |/(5)| > ^ and \f{S) - /(5)| < ^. Thus |/(5)| > g^ > |/(^) - /(^)| and 
Eses l/(^) - /(^)l < ll/lli = i-^Using equation ©, this gives EsesK/l'?) ~ /(5))'] < e/6. 

For the second term, X^s^s fi^Y — ^/^ as § D T and / is e/2-concentrated on T. Thus, E[{f{x) — 
EsesKS)-Xsix)r]<e. 

Finally, by Cauchy-Schwartz inequality: 



nifix) - Eses hS) ■ Xs{x)\] < ^E[(/(a;)-Eses/(^)-X5(x))2] < ^e. 
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